Forecasting the Evolution of Dynamical Systems from Noisy Observations 
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We consider the problem of designing almost optimal predictors for dynamical systems from a 
finite sequence of noisy observations and incomplete knowledge of the dynamics and the noise. 
We first discuss the properties of the optimal (Bayes) predictor and present the limitations of 
memory-free forecasting methods, and of any finite memory methods in general. We then show 
that a nonparametric support vector machine approach to forecasting can consistently learn the 
optimal predictor for all pairs of dynamical systems and bounded observational noise processes that 
possess summable correlation sequences. Numerical experiments show that this approach adapts the 
memory length of the forecaster to the complexity of the learning task and the size of the observation 
sequence. 



Our goal is to design almost optimal predictors for dy- 
namical systems conditioned only on a finite sequence 
of observed noisy measurements of its evolution and im- 
perfect knowledge of the dynamics and noise. Suppose 
that F : X ^ X is a map on a compact subset X C R*^ 
such that the associated dynamical system T> := (F')i>o, 
where = F o F'^~^, has an ergodic measure /i. For 
example, if X is compact and F is continuous such a 
measure always exists. Moreover, let £ := (£i)i>o be an 
R'^-valued i.i.d. process with respect to the distribution 
1/ that is independent of the process V. We assume that 
all observations of the dynamical system are perturbed 
by the process £, i.e. all observations are of the form 
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where xq is the unknown initial point of the trajectory. 
Let us now assume that we have a sequence of observa- 
tions zi, . . . , Zm and we wish to predict the observation 
Zm+i , or the true state Xm+i , after some time I as best as 
possible. Then one possible formalization of this problem 
is to look for a minimizer called the Bayes forecaster, 
of the risk functional 
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which describes the average discrepancy (measured in the 
II • llp-norm) between the predictions of / : R""' R'* and 
the observations (we have made a time-shift by m -f- 1 
in order to handle non-invertible maps F). To attain 
this goal we want to design a learning method £, which 
assigns to each observed trajectory r„ = {zq, . . . , Zn}, 
n> m, a. forecaster /t„ and which is consistent, i. e. 
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holds in probability, where 7?,*^ ; = Ti-m.iifm i) min- 
imum risk (the Bayes risk). 

Most classical forecasting methods use a Markov state- 
space approach to modeling dynamic systems and re- 



quire a model of both the flow F and the observa- 
tional noise v. These methods usually attempt to esti- 
mate the probability density function (PDF) of the state 
based on all the available information, i. e. p{Xm\Z™), 
where := {Zi, . . . , Zm) and (Zi) denotes the stochas- 
tic process that generates the observations. Then, an 
optimal (with respect to any criterion) forecaster may 
be obtained from the PDF, including E(X„i+;|Z{") = 
/ p{X„i\Z'^)F'' {X„i)dX„i. Known as nonlinear filters, 
these methods estimate recursively in time this distri- 
bution and consist of essentially two steps: a prediction 
step, that uses the system model to propagate the state 
PDF to the next observation time, and an update step, 
that uses the latest observation to modify the prediction 
PDF using Bayes' rule. Except in a few special cases, 
including linear Gaussian state space models (Kalman 
filter) and hidden finite-state space Markov chains, the 
recursive propagation of the posterior PDF cannot be de- 
termined analytically [H . Consequently, various approxi- 
mations strategies to the optimal solutions have been de- 
veloped. The most popular algorithms, the extended and 
the unscented Kalman filter, rely on anlytical approxima- 
tions of the flow or/and finite moment approximations of 
the posterior PDF. Alternatively, sequential Monte Carlo 
methods, have conceptually the advantage of not being 
subject to the assumptions of linearity or Gaussianity in 
the model, but are computationally very expensive and 
suffer from the degeneracy of the algorithm [3| . 

Notwithstanding their significant merits, there are 
many difficulties in these approaches to forecasting. 
First, due to their reliance on system and noise mod- 
els they are sensitive to model errors. Second, instead 
of solving the function estimation problem directly, for 
which the available information might suffice, they es- 
timate densities as a first step, which is a harder and 
more general problem that requires a large number of 
observations to be solved well. Third, by relying on a 
Markov state-space approach to forecasting they heavily 
restrict the class of functions available for forecasting. In 
order to further appreciate this aspect, and to better un- 
derstand our proposed approach, let us now describe the 
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FIG. 1: Comparison of three, memoryless, forecasting strate- 
gies for one-step ahead predictions for the logistic map and 
large Gaussian noise {a = 0.2): the true dynamical behav- 
ior, F (continuous line), the best forecaster (dashed), and F 
composed with the best memory-free denoiser, /i o (dotted). 

structure of the optimal forecasters in more detail for the 
special case p = 2. If the noise has no systematic bias, 
i.e. Ei^Si = 0, simple algebra then shows that ; is not 
only the optimal forecaster for the observable state Zm+i 
but also for the true state Xm+i ■ Moreover, it is well 
known that f*^i — E(Zm+;|Z™), but this closed form 
does in general not solve the issue of actually computing 
fm I since for many cases this computation is intractable 
even with perfect knowledge oi F, fj, and v. However, if 
the noise has a density d with respect to the Lebesgue 
measure, ; can be expressed by a more tractable for- 
mula, 

/m,i(^l,---,2,„) = (4) 

4, ^(zi - F{x)) ■ ■ ■ d{zm - F^{x))F"'+'{x)d^i{x) 
i?(zi - F{x)) ■ ■ ■ i?(z™ - F-'{x))d^ji{x) 

When perfect knowledge of the system is available, the 
above equation describes how to build the optimal pre- 
dictor. Even though these assumptions are rarely met 
in practice the above result points to a number of inter- 
esting and very general conclusions with regard to fore- 
casting the future of a dynamical system based on noisy 
observations: 

1) In general the flow F is not the optimal predictor. 
Indeed Eq. ^ clearly shows that the flow F is in general 
not the optimal one-step predictor based only on a noisy 
estimate of the present state of the system, i.e. F ^ fl^. 
A good example is provided by the logistic map, defined 
as F{x) = 1 — ax"^ , x £ [—1, 1], which is known to be 
ergodic and has an invariant measure given by /i(a;) = 

As we can clearly see from Fig. ([TJ, 
the true dynamical behavior F (continuous line) and the 
best forecaster fl ^ (dashed line) disagree. However, in 
the absence of observational noise, we obviously obtain 
P — fii from Eq. ([2]), i.e. F is the optimal, memoryless, 
one-step ahead predictor. 

2) Recursive forecast is worse than direct forecast. 



Since in general for I step ahead predictors we have 
fm I 7^ (/m i)' ■^fi see that iterating one-step-ahead fore- 
casts is in general worse than directly forecasting Z-steps 
ahead. In Fig. where we plot the direct two-step 
ahead forecaster as well as the recursive forecaster ob- 
tained by iterating the optimal one-step ahead Bayes pre- 
dictor, illustrates this effect for the logistic map. In gen- 
eral, as the forecasting time increases the iterated fore- 
casters become more and more inadequate. Of course, 
for noiseless observations direct and recursive predictors 
have the same forecasting performance. 
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FIG. 2: Comparison of the direct (continuous line) and recur- 
sive forecaster (dashed line) for two-step ahead predictors for 
the logistic map with large Gaussian noise with a — 0.2. Also 
shown is F^ composed with the best memory-free denoiser, 
/i,o (dotted line). 

3) Forecasts based on denoising the present state are 
not optimal. By choosing I = in Eq. ^ we obtain an 
optimal estimate {optimal denoiser) of the present state 
of the dynamical system based on a history of length m 
[H . Now a common forecasting strategy is to estimate the 
present system state first and then apply the dynamics i^' 
to this estimate. It should be is clear from our discussion 
so far that even if the estimate of the present state is op- 
timal it still approximates the true state: therefore, this 
forecaster is not the optimal l-step ahead forecaster based 
on histories of length m. The dotted line in Fig. ([1]), 
which is F composed with the best memory-free denoiser, 
/j* g, and the one in Fig. dU, which is F"^ composed with 
fiQ, illustrate the principal limit of memory-free, and of 
any finite memory denoising, followed by propagating the 
dynamics. 

4) Memory improves forecasting performance. Indeed, 
building one-step ahead predictors using the past m 
states is a minimization of the risk functional over the 
space of all measurable functions / : R™ M. This 
function space is a subset of the space of measurable func- 
tions / : M™+^ M which is the hypothesis space for the 
best one-step ahead predictor using the past m+1 states. 
Therefore, we necessarily have TZ^ ; > 'R-*mj^i i, i.e. by in- 
creasing the memory we can decrease the best possible 
prediction error: building forecasters that use informa- 
tion from the recent past of the dynamical system can 
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History 


Forecaster 


I = 1 


I = 2 


I = 3 


m = 1 


/m 

F o flo 
F 


0.0171 
0.0171 
0.0222 


0.0774 
0.0792 
0.1716 


0.2082 
0.2605 
3.8374 


m = 2 


/2,1 

F o flo 


0.0132 
0.0132 


0.0543 
0.0555 


0.1584 
0.1868 



TABLE I: Risks of the optimal predictor, /j* i, the denoised 
true dynamic, Foflg, and the true dynamic F for histories of 
length m = 1, 2 and for the logistic map with Gaussian noise 
with a = 0.05. 



substantially improve the forecasting performance. In- 
deed, as Table 1 shows, using histories of increased length 
reduces the Bayes risk for predictors of the logistic map. 

The last two features clearly show the limitations of 
memory-free forecasting methods, and of any finite mem- 
ory methods in general, because they demonstrate that 
optimal estimators are non-Markovian in the original 
state space even though the underlying dynamical sys- 
tem is deterministic. Unfortunately, it is not clear how 
prior knowledge about the system can be used to build 
non-Markovian forecasters. For this reason, we propose 
to build nonparametric predictors without making any 
assumptions about the form of the dynamics or the noise, 
while only assuming that the process described by ([T]) is 
bounded, ergodic, and has sufficiently fast decay of cor- 
relations. While nonparametric methods that are proven 
to work in a certain statistical sense different from ([2]) 
for arbitrary, unknown stationary ergodic processes (Zi) 
exist , these methods require very large data segments 
for acceptable precision. Approaches for forecasting goals 
closer to ^ were considered by, e.g., d, 0| but unfortu- 
nately these methods require certain mixing conditions 
that cannot be satisfied by dynamical systems. 

The approach we propose uses support vector machine 
(SVM) forecasters [Sj] . For simplicity we only describe 
the least squares loss, p = 2, and memory less, to = 1, 
one-step ahead forecasters, 1 = 1, and consider d = 1, but 
generalizations are straightforward. An SVM forecaster 
assigns to each finite sequence T„ a function fn,i,i,\,'y ■ 
X ^M. from a reproducing kernel Hilbert space (RKHS) 
M that solves 
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where A > is a free regularization parameter. Here, we 
choose the RKHS of a Gaussian kernel fc^ : X x X ^ R 
defined by kj{x,x') = exp(— 7^||a: — a;'|||), where 7 > 
is a free parameter called the width, but other choices 
of kernels are possible as well. It can be shown that 
the function minimizing the regularized empirical er- 
ror (O exists, is unique, and has the form fn,i,i.\,-^ = 



Yll^^i 0'.*kj{zi, •), where aj, . . . , a* is the unique solution 
of the well-posed linear system in M" 

{\nI + K)a* = Z. (6) 

Here / is the nxn identity matrix, K is the nxn matrix 
whose entry (i, j) is kj{zi, Zj), where i, j = 0, 1, . . . , n — 1, 
and Z is the n x 1 vector (zi, . . . , z„)"^ 

Unfortunately, standard learning theory cannot make 
conclusions on the behavior of '^i,i(/n,i,i,A,7)7 since the 
input/output pairs (zi,Zi_|_i) are clearly not i.i.d. sam- 
ples. On the other hand, because the observational noise 
process is weakly mixing and the dynamical system is er- 
godic, the stochastic process Zi = {F\ei,ei+i) defined 
by these pairs is ergodic. Thus, according to Birkhoff's 
ergodic theorem, it satisfies the following law of large 
numbers (LLN) 
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/i ® z^- almost everywhere, for all h G Li(p ® v) defined 
hy h o Zi = h{zi, ZiJ^i). In particular, this holds for 
h{zi,z.i,+i) = {zi+i - f{zi)Y for any / e H^. Therefore, 
the results in [10| show that there exists a null-sequence 
(A„), depending on V and S, for which the SVM fore- 
caster is consistent. We keep 7 constant, but we could 
as well find a sequence of the regularization parameter 
and the kernel width, (A„,7„), ensuring consistency as 
long as hm„^oo A„7^ = 0. This result even holds for 
unbounded, but integrable, i.i.d. noise processes. 

In order to a-priori determine for a given sample size 
n the regularization sequence (A„,7„), the convergence 
speed of the LLN that the observation-generating pro- 
cess satisfies is necessary. However, the general negative 
results from [11| strongly suggest that there exist neither 
a universal, i.e. system and noise independent, sequence 
(A„,7n) nor any other universal forecaster. However, the 
situation changes dramatically, if one restricts consider- 
ations to dynamical systems whose stochasticity can be 
described by, e.g., convergence rates of the correlations 



cor(-!/), n) := j -0 • (/s o F" d/i — y ip d^ j ipdfi (8) 

for n — > 00. Indeed, we recently showed the existence of 
a universal sequence (A„,7„) that yields an SVM which 
is consistent for all pairs of ergodic dynamical systems 
and bounded, i.i.d observational noise processes satisfying 
I cor(i/', n)| < 00 for all Lipschitz continuous ip 
and (fi [13] ■ To be more specific, the corresponding SVM 
is consistent if, e.g., we use the least squares loss and 
sequences A„ := n^" and 7„ := , n > 1, for fixed 
a and f3 satisfying 3a > 8d/3 > and 11a + 4(3 < 2. 
Moreover, if the noise process is also centered then this 
SVM actually learns to forecast the next true state. 

Let us now apply this nonparametric approach to pre- 
dicting the evolution of the x-coordinate of the Lorenz 
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FIG. 3: Performance of SVM forecasters for / — 5 (left) and 
I = 25 (right) steps ahead forecasts of the x coordinate of 
the Lorenz dynamics with Gaussian noise with a = 0.05. The 
circles and squares report, as a function of the memory length 
m, the average risk of 25 SVM forecasters built from different 
samples of size n = 800 and n — 1600 for each m. 



system described by the following set of differential equa- 
tions, X = a{y — x), y = bx — y — xz, z — xy — cz, where 
the parameters are set at the standard values a — IQ, 
b = 28, and c = 8/3. The evolution is sampled at 
time steps of r = 0.01 and the observational error is 
i.i.d Gaussain noise. We compute SVM forecasters using 
memories of increasing length and for training data of 
two different sizes n = 800 and n = 1600. Since in esti- 
mating the convergence speed of ([7]) we are using a loose 
concentration result, a suitable regularization parame- 
ter and kernel width sequence cannot be chosen a-priori. 
Hence, we have adopted a grid search in (A, 7) space and 
a 4- fold cross- vahdation technique 0] to choose (A„,7„) 
for a given sample size n. Finally, we use (A„,7„) for an 
estimate /n,i.i,A„.7„ constructed from ^ using the whole 
sample r„ (to simplify notation, we henceforward omit 
the dependence of fn,i,i on the regularization parame- 
ters (A„,7„)). This approach adapts the regularization 
parameters and the complexity of the SVM forecasters 
to the amount of available empirical data. The risk of 
each forecaster is estimated by the prediction error over 
a large test set (10^ input/output pairs) chosen indepen- 
dent of the training set T„. Note that 7?,„i,;(/) > a^, 
where a is the standard deviation of the noise. 

As Fig. ^ shows for / e {5,25}, by increasing the 
memory of the predictor we obtain forecasters with im- 
proved performance. However, for a given sample size 
n, there is an optimal memory length m and increasing 
memory length beyond this value produces poorer fore- 
caster due to their increased complexity for the available 
data. Moreover the memory of the best forecaster in- 
creases with sample size from m « 9 for n = 800 to 
m w 18 for n = 1600, as shown for 5-step ahead forecast- 
ers, while the memory increases from m ~ 9 for n = 800 
to m w 12 for n = 1600, for 25-step ahead forecasters. 
This reflects the fact that with increased information we 
can build more complex and, therefore, better predic- 
tors. Since forecasting further into the future is a more 



difficult problem, the performance of the predictor de- 
creases when we keep the same sample size but attempt 
to make 25-steps ahead predictions. Furthermore, as the 
right plot in Fig. [3] reveals, for the same sample size n 
the memory of the best forecaster reduces in order to 
accommodate the increased complexity of the learning 
task. 
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FIG. 4: Performance of direct and recursive SVM forecasters 
for / = 25 steps ahead forecasts of the x coordinate of the 
Lorenz dynamics with Gaussian noise with a = 0.20 (left) 
and a = 0.05 (right). The circles and squares report, as a 
function of the memory length m, the averages forecasting 
risk of 30 direct and recursive SVM forecasters built from 
different samples of size n — 800 for each m. 

In the limit n 00 direct forecasts are better than re- 
cursive forecasts because the SVM forecaster approaches 
the Bayes forecaster. However, for finite sample sizes 
this is not always the case. Indeed, as Fig. ^ shows for 
n = 800 there is a memory size at which the risk of the 
direct forecaster, /„,m,25, is larger than that of the re- 
cursive, /^^^ I, 25-steps ahead forecaster. Moreover, the 
crossover depends on the amount of noise and increases 
as noise variance increases while keeping the same sam- 
ple size n. By increasing the sample size though, the 
crossover increases such that in the limit of infinite sam- 
ple size direct forecasters always perform better than re- 
cursive forecasters, as is expected. 

To conclude, we have described the assumptions under 
which non-parametric SVM forecasters can consistently 
learn the optimal predictors. For example, for bounded 
noise, SVM forecasters are consistent for all pairs of dy- 
namical systems and observational noise processes that 
possess summable correlation sequences. Hence, the 
SVM forecasters possess a weak form of universality 
for a large class of stochastic processes. This includes 
systems with smooth uniformly expanding dynamics or 
smooth hyperbolic dynamics, systems perturbed by dy- 
namic noise, as well as " parabolic" or " intermittent" sys - 
tems which have a polynomial decay of correlations [l3|. 
Remarkably, for some dynamical systems it seems pos- 
sible to decide on the summability of the correlation se- 
quence from observations. Indeed, for piecewise expand- 
ing maps rigorous estimates of the asymptotic rate of 
decay of correlations for a given function is numerically 
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feasible 

We also notice that in the presence of observational 
noise the task of forecasting is different from the task of 
modelling the underlying nonlinear dynamics from data. 
Indeed, finding the simplest model consistent with the 
observations, which is the ultimate goal of modelling [l^, 
is highly unsuitable for forecasting. This is clearly illus- 
trated by noticing the significant increase of forecasting 
skill in Fig. ([3]) when using a memory length which is 
much larger than the minimal time delay embedding nec- 
essary to reconstruct a modelling phase space equivalent 
to the original state space of the Lorenz system. 

Finally, we note that our analysis remains valid for sta- 
tionary nonergodic processes. Since a nonergodic station- 
ary process has an ergodic decomposition, a realization 
of the time series falls with probability one into an invari- 
ant event on which the process is ergodic and stationary. 
For this reason, the proposed learning algorithm can be 
applied to a time series sequence generated by that event 
as though it were the process universe. 
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